Skip to content

[Feat] [history server] Add actor task endpoint#4463

Open
JiangJiaWei1103 wants to merge 50 commits intoray-project:masterfrom
JiangJiaWei1103:epic-4374/add-actor-task-endpoint
Open

[Feat] [history server] Add actor task endpoint#4463
JiangJiaWei1103 wants to merge 50 commits intoray-project:masterfrom
JiangJiaWei1103:epic-4374/add-actor-task-endpoint

Conversation

@JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Jan 30, 2026

Why are these changes needed?

This PR adds support for the /api/v0/tasks endpoint to the history server, making data structure definitions and event processing logic compatible with the ACTOR_TASK.

NOTE: We don't support the historical replay of the task state transitions in the alpha version. So, all lifecycle-related fields are gracefully downgraded to the last snapshot (i.e., overriding lifecycle-related fields in-place).

Change Summary

At a high level, this PR introduces changes across two main layers:

History Server Layer

  • Implement the core logic for the /api/v0/tasks endpoint
    • Retrieve the target task attempts from the event server layer
    • Filter tasks by query string options exclude_driver and filter triple (key, predicate, value)
    • Construct and format task information of all task attempts for API responses
      • Support detail mode by query string option detail

Event Server Layer

  • Define task-related data structures and public interfaces to different kinds of tasks
  • Refactor event processing for TASK_DEFINITION_EVENT, ACTOR_TASK_DEFINITION_EVENT and TASK_LIFECYCLE_EVENT
  • Build and maintain a TaskMap, and expose a GetTasks helper for consumption by the history server layer

Test Result

Screenshot 2026-01-31 at 10 28 49 PM

Related issue number

Closes #4388.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
…types

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
@JiangJiaWei1103 JiangJiaWei1103 marked this pull request as ready for review January 31, 2026 14:32
@JiangJiaWei1103 JiangJiaWei1103 moved this to In review in My Kuberay & Ray Jan 31, 2026
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>

taskMap := h.ClusterTaskMap.GetOrCreateTaskMap(clusterSessionKey)
taskMap.CreateOrMergeAttempt(currTask.TaskID, currTask.TaskAttempt, func(task *types.Task) {
// --- DEDUPLICATION using (State + Timestamp) as unique key ---
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplication logic can be reused from here after the node endpoint pr is merged.

Comment on lines 943 to 945
// TODO(jwj): Support profiling_data after TASK_PROFILE_EVENT is supported.
// Ref: https://github.com/ray-project/ray/blob/d0b1d151d8ea964a711e451d0ae736f8bf95b629/python/ray/util/state/common.py#L1616-L1622.
// "profiling_data": task.ProfilingData,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will support profiling_data after TASK_PROFILE_EVENT is handled.

1. Filter tasks with exclude_driver and filter triple
2. Limit task number by limit
3. Filter fields by detail
4. Don't support timeout since tasks are in mem

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
@win5923
Copy link
Member

win5923 commented Feb 2, 2026

Live cluster:

{
	"result": true,
	"msg": "",
	"data": {
		"result": {
			"total": 4,
			"num_after_truncation": 4,
			"num_filtered": 4,
			"result": [
				{
					"task_id": "39088be3736e590a551342f87f88337185f3c5d202000000",
					"name": "Counter.get_count",
					"attempt_number": 0,
					"func_or_class_name": "Counter.get_count",
					"actor_id": "551342f87f88337185f3c5d202000000",
					"worker_pid": 294,
					"worker_id": "a89649063d8196704d484c9ebcc71b649ed319e2783ad9667c5afd5f",
					"error_type": null,
					"node_id": "c105c0dde5e88ec9ecd0417de065e5cb32eb0cbcbb1239988cc5c636",
					"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
					"job_id": "02000000",
					"state": "FINISHED",
					"type": "ACTOR_TASK"
				},
				{
					"task_id": "67a2e8cfa5a06db3ffffffffffffffffffffffff02000000",
					"name": "my_task",
					"attempt_number": 0,
					"func_or_class_name": "my_task",
					"actor_id": null,
					"worker_pid": 293,
					"worker_id": "c10259db5790b44dd4a01f18af411a2c0e334154d8e3261661d3f59c",
					"error_type": null,
					"node_id": "c105c0dde5e88ec9ecd0417de065e5cb32eb0cbcbb1239988cc5c636",
					"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
					"job_id": "02000000",
					"state": "FINISHED",
					"type": "NORMAL_TASK"
				},
				{
					"task_id": "e5cbd90b7f1fb776551342f87f88337185f3c5d202000000",
					"name": "Counter.increment",
					"attempt_number": 0,
					"func_or_class_name": "Counter.increment",
					"actor_id": "551342f87f88337185f3c5d202000000",
					"worker_pid": 294,
					"worker_id": "a89649063d8196704d484c9ebcc71b649ed319e2783ad9667c5afd5f",
					"error_type": null,
					"node_id": "c105c0dde5e88ec9ecd0417de065e5cb32eb0cbcbb1239988cc5c636",
					"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
					"job_id": "02000000",
					"state": "FINISHED",
					"type": "ACTOR_TASK"
				},
				{
					"task_id": "ffffffffffffffff551342f87f88337185f3c5d202000000",
					"name": "Counter.__init__",
					"attempt_number": 0,
					"func_or_class_name": "Counter.__init__",
					"actor_id": "551342f87f88337185f3c5d202000000",
					"worker_pid": 294,
					"worker_id": null,
					"error_type": null,
					"node_id": null,
					"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
					"job_id": "02000000",
					"state": "FINISHED",
					"type": "ACTOR_CREATION_TASK"
				}
			],
			"partial_failure_warning": "",
			"warnings": null
		}
	}
}

Dead Cluster:

{
	"result": true,
	"msg": "",
	"data": {
		"result": {
			"total": 6,
			"num_after_truncation": 6,
			"num_filtered": 6,
			"result": [
				{
					"actor_id": "",
					"attempt_number": 0,
					"error_type": "",
					"func_or_class_name": "",
					"job_id": "",
					"name": "",
					"node_id": "",
					"parent_task_id": "////////////////////////////////",
					"state": "RUNNING",
					"task_id": "//////////////////////////8BAAAA",
					"type": "DRIVER_TASK",
					"worker_id": "",
					"worker_pid": 0
				},
				{
					"actor_id": "",
					"attempt_number": 0,
					"error_type": "",
					"func_or_class_name": "",
					"job_id": "",
					"name": "",
					"node_id": "",
					"parent_task_id": "////////////////////////////////",
					"state": "FINISHED",
					"task_id": "//////////////////////////8CAAAA",
					"type": "DRIVER_TASK",
					"worker_id": "",
					"worker_pid": 0
				},
				{
					"actor_id": "",
					"attempt_number": 0,
					"error_type": "",
					"func_or_class_name": "Counter.__init__",
					"job_id": "AgAAAA==",
					"name": "Counter.__init__",
					"node_id": "",
					"parent_task_id": "//////////////////////////8CAAAA",
					"state": "FINISHED",
					"task_id": "//////////9VE0L4f4gzcYXzxdICAAAA",
					"type": "ACTOR_CREATION_TASK",
					"worker_id": "",
					"worker_pid": 294
				},
				{
					"actor_id": "VRNC+H+IM3GF88XSAgAAAA==",
					"attempt_number": 0,
					"error_type": "",
					"func_or_class_name": "Counter.increment",
					"job_id": "AgAAAA==",
					"name": "Counter.increment",
					"node_id": "",
					"parent_task_id": "//////////////////////////8CAAAA",
					"state": "FINISHED",
					"task_id": "5cvZC38ft3ZVE0L4f4gzcYXzxdICAAAA",
					"type": "ACTOR_TASK",
					"worker_id": "",
					"worker_pid": 294
				},
				{
					"actor_id": "VRNC+H+IM3GF88XSAgAAAA==",
					"attempt_number": 0,
					"error_type": "",
					"func_or_class_name": "Counter.get_count",
					"job_id": "AgAAAA==",
					"name": "Counter.get_count",
					"node_id": "",
					"parent_task_id": "//////////////////////////8CAAAA",
					"state": "FINISHED",
					"task_id": "OQiL43NuWQpVE0L4f4gzcYXzxdICAAAA",
					"type": "ACTOR_TASK",
					"worker_id": "",
					"worker_pid": 294
				},
				{
					"actor_id": "",
					"attempt_number": 0,
					"error_type": "",
					"func_or_class_name": "my_task",
					"job_id": "AgAAAA==",
					"name": "my_task",
					"node_id": "",
					"parent_task_id": "//////////////////////////8CAAAA",
					"state": "FINISHED",
					"task_id": "Z6Loz6WgbbP///////////////8CAAAA",
					"type": "NORMAL_TASK",
					"worker_id": "",
					"worker_pid": 293
				}
			],
			"partial_failure_warning": "",
			"warnings": null
		}
	}
}

Missing node_id and worker_id, is it normal?

Comment on lines 37 to 40
func ParseOptionsFromReq(req *restful.Request) (ListAPIOptions, error) {
opts := ListAPIOptions{
Limit: RayMaxLimitFromDataSource,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@JiangJiaWei1103 JiangJiaWei1103 Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! The changes in 4a4ab9b aim to align the behavior as closely as possible with Ray’s Dashboard.

I think there are still a few points worth further discussion:

  • limit: In Ray, users can configure a client-side limit via the RAY_MAX_LIMIT_FROM_API_SERVER environment variable. Should we consider supporting a similar mechanism in the history server?
  • timeout: Since the listing methods in the history server rely on in-memory maps, queries should be fast. Would it be reasonable to ignore the timeout setting in this case?
  • detail and exclude_driver: These now align with Ray’s default values.

Copy link
Member

@win5923 win5923 Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limit: In Ray, users can configure a client-side limit via the RAY_MAX_LIMIT_FROM_API_SERVER environment variable. Should we consider supporting a similar mechanism in the history server?

Yes, we can address this together with the timeout in a follow-up. This also observed the same issue in api/v0/logs/file, we can implement this after all endpoints are completed.

#4471

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Uris *RuntimeEnvUris `json:"uris,omitempty"`
// The serialized runtime env config passed from the user.
RuntimeEnvConfig RuntimeEnvConfig `json:"runtimeEnvConfig"`
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entire runtime_environment.go file is unused

Medium Severity

The types RuntimeEnvInfo, RuntimeEnvConfig, and RuntimeEnvUris are defined but never used anywhere in the codebase. No code imports or references these types.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs at a4604db: Note that these two fields are never populated on the Ray side.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Copilot AI mentioned this pull request Feb 5, 2026
4 tasks
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

@JiangJiaWei1103
Copy link
Contributor Author

JiangJiaWei1103 commented Feb 5, 2026

Summary

The following demonstrates the API schema of the /api/v0/tasks?detail=1 endpoint:

Live Cluster

{
  "result": true,
  "msg": "",
  "data": {
    "result": {
      "total": 4,
      "num_after_truncation": 4,
      "num_filtered": 4,
      "result": [
        {
          "attempt_number": 0,
          "language": "PYTHON",
          "job_id": "02000000",
          "events": [
            {
              "state": "PENDING_ARGS_AVAIL",
              "created_ms": 1770211318150
            },
            {
              "state": "PENDING_NODE_ASSIGNMENT",
              "created_ms": 1770211318150
            },
            {
              "state": "SUBMITTED_TO_WORKER",
              "created_ms": 1770211318150
            },
            {
              "state": "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY",
              "created_ms": 1770211318150
            },
            {
              "state": "RUNNING",
              "created_ms": 1770211318150
            },
            {
              "state": "FINISHED",
              "created_ms": 1770211318150
            }
          ],
          "required_resources": {},
          "func_or_class_name": "Counter.get_count",
          "task_log_info": null,
          "label_selector": {},
          "name": "Counter.get_count",
          "profiling_data": {
            "component_type": "worker",
            "component_id": "07584bda3bba1ab74b3d9a305a412d2726ce0e91e565c200f2623134",
            "node_ip_address": "10.244.0.48",
            "events": [
              {
                "start_time": 1770211318150.7988,
                "end_time": 1770211318150.8,
                "extra_data": {},
                "event_name": "task:deserialize_arguments"
              },
              {
                "start_time": 1770211318150.8047,
                "end_time": 1770211318150.82,
                "extra_data": {},
                "event_name": "task:execute"
              },
              {
                "start_time": 1770211318150.8213,
                "end_time": 1770211318150.853,
                "extra_data": {},
                "event_name": "task:store_outputs"
              },
              {
                "start_time": 1770211318150.784,
                "end_time": 1770211318150.8577,
                "extra_data": {
                  "name": "get_count",
                  "task_id": "39088be3736e590a051aa2759ceb4431ad03962e02000000"
                },
                "event_name": "task::Counter.get_count"
              }
            ]
          },
          "end_time_ms": 1770211318150,
          "state": "FINISHED",
          "is_debugger_paused": null,
          "call_site": null,
          "worker_id": "07584bda3bba1ab74b3d9a305a412d2726ce0e91e565c200f2623134",
          "type": "ACTOR_TASK",
          "error_type": null,
          "runtime_env_info": {
            "serialized_runtime_env": "{}",
            "runtime_env_config": {
              "setup_timeout_seconds": 600,
              "eager_install": true,
              "log_files": []
            }
          },
          "creation_time_ms": 1770211318150,
          "actor_id": "051aa2759ceb4431ad03962e02000000",
          "parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
          "worker_pid": 241,
          "placement_group_id": null,
          "start_time_ms": 1770211318150,
          "error_message": null,
          "task_id": "39088be3736e590a051aa2759ceb4431ad03962e02000000",
          "node_id": "49029ecd45f2139d33149e51b8732b5573bbca0f813e50145134d4db"
        },
        {
          "attempt_number": 0,
          "language": "PYTHON",
          "job_id": "02000000",
          "events": [
            {
              "state": "PENDING_ARGS_AVAIL",
              "created_ms": 1770211317649
            },
            {
              "state": "PENDING_NODE_ASSIGNMENT",
              "created_ms": 1770211317649
            },
            {
              "state": "SUBMITTED_TO_WORKER",
              "created_ms": 1770211317969
            },
            {
              "state": "RUNNING",
              "created_ms": 1770211317970
            },
            {
              "state": "FINISHED",
              "created_ms": 1770211318053
            }
          ],
          "required_resources": {
            "CPU": 0.5
          },
          "func_or_class_name": "my_task",
          "task_log_info": {
            "stdout_file": "/tmp/ray/session_2026-02-04_05-21-24_586619_1/logs/worker-2e6e9aa30b493469bbd886e115704121935592922b6ec19388373bab-02000000-242.out",
            "stderr_file": "/tmp/ray/session_2026-02-04_05-21-24_586619_1/logs/worker-2e6e9aa30b493469bbd886e115704121935592922b6ec19388373bab-02000000-242.err",
            "stdout_start": 36,
            "stdout_end": 36,
            "stderr_start": 36,
            "stderr_end": 36
          },
          "label_selector": {},
          "name": "my_task",
          "profiling_data": {
            "component_type": "worker",
            "component_id": "2e6e9aa30b493469bbd886e115704121935592922b6ec19388373bab",
            "node_ip_address": "10.244.0.48",
            "events": [
              {
                "start_time": 1770211317971.4631,
                "end_time": 1770211318052.6414,
                "extra_data": {},
                "event_name": "task:deserialize_arguments"
              },
              {
                "start_time": 1770211318052.6765,
                "end_time": 1770211318052.7,
                "extra_data": {},
                "event_name": "task:execute"
              },
              {
                "start_time": 1770211318052.7014,
                "end_time": 1770211318052.7334,
                "extra_data": {},
                "event_name": "task:store_outputs"
              },
              {
                "start_time": 1770211317971.4526,
                "end_time": 1770211318052.741,
                "extra_data": {
                  "name": "__main__.my_task",
                  "task_id": "67a2e8cfa5a06db3ffffffffffffffffffffffff02000000"
                },
                "event_name": "task::my_task"
              }
            ]
          },
          "end_time_ms": 1770211318053,
          "state": "FINISHED",
          "is_debugger_paused": null,
          "call_site": null,
          "worker_id": "2e6e9aa30b493469bbd886e115704121935592922b6ec19388373bab",
          "type": "NORMAL_TASK",
          "error_type": null,
          "runtime_env_info": {
            "serialized_runtime_env": "{}",
            "runtime_env_config": {
              "setup_timeout_seconds": 600,
              "eager_install": true,
              "log_files": []
            }
          },
          "creation_time_ms": 1770211317649,
          "actor_id": null,
          "parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
          "worker_pid": 242,
          "placement_group_id": null,
          "start_time_ms": 1770211317970,
          "error_message": null,
          "task_id": "67a2e8cfa5a06db3ffffffffffffffffffffffff02000000",
          "node_id": "49029ecd45f2139d33149e51b8732b5573bbca0f813e50145134d4db"
        },
        {
          "attempt_number": 0,
          "language": "PYTHON",
          "job_id": "02000000",
          "events": [
            {
              "state": "PENDING_ARGS_AVAIL",
              "created_ms": 1770211318056
            },
            {
              "state": "PENDING_NODE_ASSIGNMENT",
              "created_ms": 1770211318056
            },
            {
              "state": "SUBMITTED_TO_WORKER",
              "created_ms": 1770211318063
            },
            {
              "state": "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY",
              "created_ms": 1770211318063
            },
            {
              "state": "RUNNING",
              "created_ms": 1770211318063
            },
            {
              "state": "FINISHED",
              "created_ms": 1770211318150
            }
          ],
          "required_resources": {},
          "func_or_class_name": "Counter.increment",
          "task_log_info": null,
          "label_selector": {},
          "name": "Counter.increment",
          "profiling_data": {
            "component_type": "worker",
            "component_id": "07584bda3bba1ab74b3d9a305a412d2726ce0e91e565c200f2623134",
            "node_ip_address": "10.244.0.48",
            "events": [
              {
                "start_time": 1770211318064.0505,
                "end_time": 1770211318064.0513,
                "extra_data": {},
                "event_name": "task:deserialize_arguments"
              },
              {
                "start_time": 1770211318064.055,
                "end_time": 1770211318064.0867,
                "extra_data": {},
                "event_name": "task:execute"
              },
              {
                "start_time": 1770211318064.0889,
                "end_time": 1770211318149.785,
                "extra_data": {},
                "event_name": "task:store_outputs"
              },
              {
                "start_time": 1770211318064.0417,
                "end_time": 1770211318149.7983,
                "extra_data": {
                  "name": "increment",
                  "task_id": "e5cbd90b7f1fb776051aa2759ceb4431ad03962e02000000"
                },
                "event_name": "task::Counter.increment"
              }
            ]
          },
          "end_time_ms": 1770211318150,
          "state": "FINISHED",
          "is_debugger_paused": null,
          "call_site": null,
          "worker_id": "07584bda3bba1ab74b3d9a305a412d2726ce0e91e565c200f2623134",
          "type": "ACTOR_TASK",
          "error_type": null,
          "runtime_env_info": {
            "serialized_runtime_env": "{}",
            "runtime_env_config": {
              "setup_timeout_seconds": 600,
              "eager_install": true,
              "log_files": []
            }
          },
          "creation_time_ms": 1770211318056,
          "actor_id": "051aa2759ceb4431ad03962e02000000",
          "parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
          "worker_pid": 241,
          "placement_group_id": null,
          "start_time_ms": 1770211318063,
          "error_message": null,
          "task_id": "e5cbd90b7f1fb776051aa2759ceb4431ad03962e02000000",
          "node_id": "49029ecd45f2139d33149e51b8732b5573bbca0f813e50145134d4db"
        },
        {
          "attempt_number": 0,
          "language": "PYTHON",
          "job_id": "02000000",
          "events": [
            {
              "state": "PENDING_ARGS_AVAIL",
              "created_ms": 1770211318056
            },
            {
              "state": "PENDING_NODE_ASSIGNMENT",
              "created_ms": 1770211318057
            },
            {
              "state": "RUNNING",
              "created_ms": 1770211318060
            },
            {
              "state": "FINISHED",
              "created_ms": 1770211318063
            }
          ],
          "required_resources": {
            "CPU": 0.5
          },
          "func_or_class_name": "Counter.__init__",
          "task_log_info": null,
          "label_selector": {},
          "name": "Counter.__init__",
          "profiling_data": {
            "component_type": "worker",
            "component_id": "07584bda3bba1ab74b3d9a305a412d2726ce0e91e565c200f2623134",
            "node_ip_address": "10.244.0.48",
            "events": [
              {
                "start_time": 1770211318062.4192,
                "end_time": 1770211318062.42,
                "extra_data": {},
                "event_name": "task:deserialize_arguments"
              },
              {
                "start_time": 1770211318062.424,
                "end_time": 1770211318062.431,
                "extra_data": {},
                "event_name": "task:execute"
              },
              {
                "start_time": 1770211318062.4353,
                "end_time": 1770211318062.4365,
                "extra_data": {},
                "event_name": "task:store_outputs"
              },
              {
                "start_time": 1770211318062.4097,
                "end_time": 1770211318062.44,
                "extra_data": {
                  "name": "__init__",
                  "task_id": "ffffffffffffffff051aa2759ceb4431ad03962e02000000"
                },
                "event_name": "task::Counter.__init__"
              }
            ]
          },
          "end_time_ms": 1770211318063,
          "state": "FINISHED",
          "is_debugger_paused": null,
          "call_site": null,
          "worker_id": null,
          "type": "ACTOR_CREATION_TASK",
          "error_type": null,
          "runtime_env_info": {
            "serialized_runtime_env": "{}",
            "runtime_env_config": {
              "setup_timeout_seconds": 600,
              "eager_install": true,
              "log_files": []
            }
          },
          "creation_time_ms": 1770211318056,
          "actor_id": "051aa2759ceb4431ad03962e02000000",
          "parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
          "worker_pid": 241,
          "placement_group_id": null,
          "start_time_ms": 1770211318060,
          "error_message": null,
          "task_id": "ffffffffffffffff051aa2759ceb4431ad03962e02000000",
          "node_id": null
        }
      ],
      "partial_failure_warning": "",
      "warnings": null
    }
  }
}

Dead Cluster

{
  "result": true,
  "msg": "",
  "data": {
    "result": {
      "total": 6,
      "num_after_truncation": 6,
      "num_filtered": 4,
      "result": [
        {
          "actor_id": "f6080df37a35848b2468441a02000000",
          "attempt_number": 0,
          "call_site": null,
          "creation_time_ms": 1770211244699,
          "end_time_ms": 1770211244700,
          "error_message": null,
          "error_type": null,
          "events": [
            {
              "created_ms": 1770211244699,
              "state": "PENDING_ARGS_AVAIL"
            },
            {
              "created_ms": 1770211244699,
              "state": "PENDING_NODE_ASSIGNMENT"
            },
            {
              "created_ms": 1770211244699,
              "state": "SUBMITTED_TO_WORKER"
            },
            {
              "created_ms": 1770211244700,
              "state": "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY"
            },
            {
              "created_ms": 1770211244700,
              "state": "RUNNING"
            },
            {
              "created_ms": 1770211244700,
              "state": "FINISHED"
            }
          ],
          "func_or_class_name": "Counter.get_count",
          "is_debugger_paused": null,
          "job_id": "02000000",
          "label_selector": {},
          "language": "PYTHON",
          "name": "Counter.get_count",
          "node_id": "3c5bf314a66f86ebfeee453d409b8318fa46f636a12a7a0a23feecab",
          "parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
          "placement_group_id": null,
          "required_resources": {},
          "runtime_env_info": {
            "runtime_env_config": {
              "eager_install": true,
              "log_files": [],
              "setup_timeout_seconds": 600
            },
            "serialized_runtime_env": "{}"
          },
          "start_time_ms": 1770211244700,
          "state": "FINISHED",
          "task_id": "39088be3736e590af6080df37a35848b2468441a02000000",
          "task_log_info": null,
          "type": "ACTOR_TASK",
          "worker_id": "a095c2237c3c8d2bd1bd834117e4d4b89abf7fc904a1bf6cbda4af5d",
          "worker_pid": 240
        },
        {
          "actor_id": "",
          "attempt_number": 0,
          "call_site": null,
          "creation_time_ms": 1770211244350,
          "end_time_ms": 1770211244615,
          "error_message": null,
          "error_type": null,
          "events": [
            {
              "created_ms": 1770211244350,
              "state": "PENDING_ARGS_AVAIL"
            },
            {
              "created_ms": 1770211244350,
              "state": "PENDING_NODE_ASSIGNMENT"
            },
            {
              "created_ms": 1770211244536,
              "state": "SUBMITTED_TO_WORKER"
            },
            {
              "created_ms": 1770211244536,
              "state": "RUNNING"
            },
            {
              "created_ms": 1770211244615,
              "state": "FINISHED"
            }
          ],
          "func_or_class_name": "my_task",
          "is_debugger_paused": null,
          "job_id": "02000000",
          "label_selector": {},
          "language": "PYTHON",
          "name": "my_task",
          "node_id": "3c5bf314a66f86ebfeee453d409b8318fa46f636a12a7a0a23feecab",
          "parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
          "placement_group_id": null,
          "required_resources": {
            "CPU": 0.5
          },
          "runtime_env_info": {
            "runtime_env_config": {
              "eager_install": true,
              "log_files": [],
              "setup_timeout_seconds": 600
            },
            "serialized_runtime_env": "{}"
          },
          "start_time_ms": 1770211244536,
          "state": "FINISHED",
          "task_id": "67a2e8cfa5a06db3ffffffffffffffffffffffff02000000",
          "task_log_info": null,
          "type": "NORMAL_TASK",
          "worker_id": "2f051136bf328c5e163175e6d00dd74c7f12ddafee86cdf513e1b2bf",
          "worker_pid": 241
        },
        {
          "actor_id": "f6080df37a35848b2468441a02000000",
          "attempt_number": 0,
          "call_site": null,
          "creation_time_ms": 1770211244618,
          "end_time_ms": 1770211244699,
          "error_message": null,
          "error_type": null,
          "events": [
            {
              "created_ms": 1770211244618,
              "state": "PENDING_ARGS_AVAIL"
            },
            {
              "created_ms": 1770211244618,
              "state": "PENDING_NODE_ASSIGNMENT"
            },
            {
              "created_ms": 1770211244623,
              "state": "SUBMITTED_TO_WORKER"
            },
            {
              "created_ms": 1770211244624,
              "state": "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY"
            },
            {
              "created_ms": 1770211244624,
              "state": "RUNNING"
            },
            {
              "created_ms": 1770211244699,
              "state": "FINISHED"
            }
          ],
          "func_or_class_name": "Counter.increment",
          "is_debugger_paused": null,
          "job_id": "02000000",
          "label_selector": {},
          "language": "PYTHON",
          "name": "Counter.increment",
          "node_id": "3c5bf314a66f86ebfeee453d409b8318fa46f636a12a7a0a23feecab",
          "parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
          "placement_group_id": null,
          "required_resources": {},
          "runtime_env_info": {
            "runtime_env_config": {
              "eager_install": true,
              "log_files": [],
              "setup_timeout_seconds": 600
            },
            "serialized_runtime_env": "{}"
          },
          "start_time_ms": 1770211244624,
          "state": "FINISHED",
          "task_id": "e5cbd90b7f1fb776f6080df37a35848b2468441a02000000",
          "task_log_info": null,
          "type": "ACTOR_TASK",
          "worker_id": "a095c2237c3c8d2bd1bd834117e4d4b89abf7fc904a1bf6cbda4af5d",
          "worker_pid": 240
        },
        {
          "actor_id": "",
          "attempt_number": 0,
          "call_site": null,
          "creation_time_ms": 1770211244618,
          "end_time_ms": 1770211244623,
          "error_message": null,
          "error_type": null,
          "events": [
            {
              "created_ms": 1770211244618,
              "state": "PENDING_ARGS_AVAIL"
            },
            {
              "created_ms": 1770211244619,
              "state": "PENDING_NODE_ASSIGNMENT"
            },
            {
              "created_ms": 1770211244621,
              "state": "RUNNING"
            },
            {
              "created_ms": 1770211244623,
              "state": "FINISHED"
            }
          ],
          "func_or_class_name": "Counter.__init__",
          "is_debugger_paused": null,
          "job_id": "02000000",
          "label_selector": {},
          "language": "PYTHON",
          "name": "Counter.__init__",
          "node_id": "",
          "parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
          "placement_group_id": null,
          "required_resources": {
            "CPU": 0.5
          },
          "runtime_env_info": {
            "runtime_env_config": {
              "eager_install": true,
              "log_files": [],
              "setup_timeout_seconds": 600
            },
            "serialized_runtime_env": "{}"
          },
          "start_time_ms": 1770211244621,
          "state": "FINISHED",
          "task_id": "fffffffffffffffff6080df37a35848b2468441a02000000",
          "task_log_info": null,
          "type": "ACTOR_CREATION_TASK",
          "worker_id": "",
          "worker_pid": 240
        }
      ],
      "partial_failure_warning": "",
      "warnings": null
    }
  }
}

Taking Counter.get_count as an example, the only remaining discrepancy is the profiling_data field. We plan to address this once support for TASK_PROFILE_EVENT processing is available (see #4437).

Follow-ups

1. Historical Replay

Currently, TASK_LIFECYCLE_EVENT fields are overwritten when a new event is ingested for the same task attempt. This prevents users from fully reconstructing the complete historical replay of a task and accurately reflecting its state transitions.

TODO: Preserve timestamped task states for each task attempt.

2. Filters and Query Parameters

Different state entities (e.g., nodes, tasks, actors) may have their own sets of filterable fields for GET APIs. Each entity could define its own filterable fields (similar to this example) while reusing the shared filtering helpers.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
@JiangJiaWei1103
Copy link
Contributor Author

Once this PR goes through the final pass, I'll revert the first commit used for local dev. Thanks.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link
Member

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for your effort.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chatted with @JiangJiaWei1103 offline, LGTM, tks!

@Future-Outlier
Copy link
Member

cc @rueian to merge, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][history server] support endpoint /api/v0/tasks (type=ACTOR_TASK)

3 participants